-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for pre-tokenized streaming dataset finetuning #601
Conversation
Thanks! Will take a look after we resolve #600 |
4350de0
to
382c88a
Compare
This patch fixes a pyright string concatenation warning and also adds typing information where necessary.
dcc6e43
to
1831916
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not opposed to supporting pre-tokenized streaming datasets. But, I am concerned that the tokens/labels structure used to identify pre-tokenized formats is too arbitrary. It's an issue mostly because we don't have any tools that someone can use to build a streaming dataset with the tokens/labels structure.
Rather than having a "backdoor" in StreamingFinetuningDataset
that recognizes pre-tokenized formats from the tokens/labels structure, I'd prefer that we keep with the prompt/response structure and simply allow the code to determine whether tokenization is required based on whether the prompt/response values are strings or bytes. I think this will make it easier to simply add a pre-tokenization option to our existing tooling.
Please let me know if this ask is unclear. Thanks for helping to grow our codebase!
That's a pretty reasonable request, especially given there's nothing upstream using this! I'll change the implementation to follow your suggestion. I'm planning on writing up some tooling eventually designed for upstream use that uses this format (probably as a part of #611), but that'll come later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this and incorporating the suggestions! LGTM!!
This has been done by #945 |
This PR adds in support for performing fine-tuning training with a streaming dataset that has already been pre-tokenized.
The test passing is dependent on the fix in #600.